Analytical Features for the Classification of Percussive Sounds: the Case of the Pandeiro
نویسندگان
چکیده
There is an increasing need for automatically classifying sounds for MIR and interactive music applications. In the context of supervised classification, we describe an approach that improves the performance of the general bag-of-frame scheme without loosing its generality. This method is based on the construction and exploitation of specific audio features, called analytical, as input to classifiers. These features are better, in a sense we define precisely than standard, general features, or even than ad hoc features designed by hand for specific problems. To construct these features, our method explores a very large space of functions, by composing basic operators in syntactically correct ways. These operators are taken from the Mathematical and Audio Processing domains. Our method allows us to build a large number of these features, evaluate and select them automatically for arbitrary audio classification problems. We present here a specific study concerning the analysis of Pandeiro (Brazilian tambourine) sounds. Two problems are considered: the classification of entire sounds, for MIR applications, and the classification of attacks portions of the sound only, for interactive music applications. We evaluate precisely the gain obtained by analytical features on these two problems, in comparison with standard approaches. 1. ACOUSTIC FEATURES Most audio classification approaches use either one of these two paradigms: a general scheme, called bag-of-frames, or ad hoc approaches. The bag-of-frame approach ([2], also cited [41]) consists in considering the signal in a blind way, using a systematic and general scheme: the signal is sliced into consecutive, possibly overlapping frames (typically of 50ms), from which a vector of audio features is computed. The features are supposed to represent characteristic information of the signal for the problem at hand. These vectors are then aggregated (hence the “bag”) and fed to the rest of the chain. First, a subset of available features is identified, using some feature selection algorithm. Then the feature set is used to train a classifier, from a database of labeled signals (training set). The classifier thus obtained is then usually tested against another database (test set) to assess its performance. The use of the features as input to classifiers plays two roles: a dimension reduction role, and a representation role. Indeed, the signal itself could in principle be used as input to classifiers, but its dimension (number of samples) is usually too high with respect to the training set size, resulting in overfitting. Additionally, the time/amplitude representation of signals has long been acknowledged to be poorly adapted to represent perceptive information: audio features used in the classification literature aim precisely at capturing essential perceptive characteristics of audio signals that are not easily revealed in the temporal representation. A source of audio features is for instance MPEG7-audio ([15] or more specifically [28] or [20]) for the music domain. These features are usually of low dimensionality, and contain statistical information from the temporal domain (e.g. Zero-crossing rate), spectral domain (e.g. SpectralCentroid), or more perceptive aspects (such as sharpness, relative loudness, etc.). The bag-of-frame approach has been used extensively in the MIR domain, for instance by [32]. A large proportion of MIR related papers has been devoted to studying the details of this chain of process: feature identification [28]; feature aggregation [34]; feature selection [26],[22],[7]; classifier comparison or tuning [1],[41]. An even larger proportion of ISMIR papers discuss the application of this approach to specific musical problems: genre classification [38],[21],[25],[39]; orchestral sound [27]; percussion instrument [37],[35],[13],[36]; tabla strokes [9],[6]; audio fingerprinting [5]; noises [12] as well as identification tasks, such as vocal identification [18] or mood detection [19]. This approach achieves a reasonable degree of success on some problems. For instance, speech music discrimination systems based on the bag-of-frame paradigm yield almost perfect results. However, the approach shows limitations when applied to more “difficult” problems. Although classification difficulty is hard to define precisely, it can be noted that problems involving classes with a smaller degree of abstraction are usually much more Proc. of the 10th Int. Conference on Digital Audio Effects (DAFx-07), Bordeaux, France, September 10-15, 2007 DAFX-2 difficult to solve. For instance, genre classification works well on abstract, large categories (Jazz vs. Rock), but performance degrades for more precise classes (e.g. Be-bop vs. Hard-bop). In these cases, the natural tendency is usually to look for ad hoc approaches, which aim at extracting “manually” from the signal the characteristics most appropriate for the problem at hand, and exploit them accordingly. This can be done either by defining ad hoc features, integrated in the bag-of-frame approach (e.g. the 4-Hertz modulation energy used in some speech/music classifiers, [32], or by defining completely different schemes for classifying, e.g. the analysis-by-synthesis approach designed for drum sound classification [45], and further developed by [44] and [31]. One of the possible reasons for the limitation of bag-of-frame approach is that the generic features used, such as the Mpeg-7 feature set, do not always capture the relevant perceptive characteristics of the signals to be classified. Some classifier algorithms, such as kernel methods [33] including Support Vector Machines [4],[34] do try to transform the feature space with the aim of improving inter-class separability. However, the increasing sophistication of feature selection or classifier algorithms cannot compensate for any lack of information in the initial features set. Although ad hoc approaches may indeed reach interesting performance, they are rarely reusable: ad hoc features are, by definition, problem specific. Consequently the scientific contribution (and epistemological status) of reports of ad hoc approaches is highly debatable. In this work we try to extend the range of applications for which the general bag-of-frame approach gives satisfactory results, by proposing a mechanism that invents specific ad hoc features, in an automatic way to improve the classification performance. To find better features than the generic ones, one can find inspiration in the way human experts actually invent ad hoc features. The papers quoted above use a number of tricks and techniques to this aim, combined with intuitions and musical knowledge. For instance, one can use some front-end system to normalize a signal, or pass it through some filter, add pre or postprocessing to isolate the (hopefully) most salient characteristics of the signal. We propose here to automate a process of feature invention, by an algorithm which explores quickly a very large space of ad hoc functions. The functions are built by composing together in the sense of functional composition elementary operators. We call these functions analytical because they are described by an explicit composition of functions, as opposed to other forms of signal reduction, such as arbitrary computer programs. This paper is structured as follows: In Section 2 we introduce the EDS system, designed to create automatically and explore large sets of analytical features. Section 3 is devoted to the description of several experiments to compare the performance of analytical features against generic ones, on two sound classification problems for the Pandeiro (Brazilian percussion instrument): an easy one, for MIR applications, and a more difficult one, for interactive music applications. 2. CREATION OF ANALYTIC FEATURES: THE EDS SYSTEM EDS – Extractor Discovery System – is developed at the Sony CSL laboratory in Paris [45] to study experimentally the notion of analytical feature for audio signal processing applications. The EDS system is able to explore efficiently the space of analytical features for arbitrary supervised audio classification problem. A problem is determined by a database of audio samples labelled (usually by hand) with a finite set of classes. The exploration of the space of analytical features is based on various function creation methods from a set of basic operators, considered as elementary. These two aspects are described in the following sections. 2.1. A library of elementary operators The choice of elementary operators is of course arbitrary. These operators were selected so as to allow the creation of functions with a “reasonable” degree of abstraction, i.e. represent salient perceptive characteristics of the sound with a small number of operators (about 10, see below), while allowing to create new, and possibly relevant functions. These operators are either basic mathematical operations (e.g. absolute value, max, mean) or signal processing operators such as Fourier transforms, filters, Db, and spectral operators like spectralCentroid, spectralSkewness. This library also includes more specifically musical operators such as Pitch or Ltas (Long Term Average Spectrum). For the sake of reproducibility, we describe in this paper results obtained with the 76 basic operators listed in Annex 1. If we limit the size of analytical features we create (i.e. the number of operators used in its expression), we explore a finite function space. To give a rough idea of its size: the feature space of features composed of at most 5 operator contains 2,5.10 functions. In practice, we explore functions of size at most 10, which represents a space of 5.10 functions. Here are some typical examples of functions generated by EDS: (A) Mean(Mfcc(Differentiation(x),5)) (B) Median(Rms(Split(Normalize(x),32))) The first function (A) computes the average of the 5 first cepstral coefficients of the differentiation of the signal (represented by x). The second one (B) computes the mean value (Median) of the energy (Rms) of successive frames (split) of 32 samples long in the normalized signal. Feature creation is controlled by two mechanisms: 1 – Each basic operator is typed according to the physical dimensions of its arguments. Types avoid creating syntactically meaningless features. For instance, the Fft operator takes as input something of the “time/amplitude” type, and its output type is “frequency/amplitude”. EDS can therefore generate Fft(HpFilter(x)), but not, e.g., Fft(max(x)). 2 – Heuristics allow the system to further avoid creating unpromising functions. E.g. a heuristics penalizes functions with too many repetitions, like Fft(Fft((Fft(x)))). Proc. of the 10th Int. Conference on Digital Audio Effects (DAFx-07), Bordeaux, France, September 10-15, 2007 DAFX-3 In practice, adding a new basic operator to the library amounts to define 1) corresponding typing rules and 2) heuristics to control the use of this operator (see [24]). 2.2. Creating analytical features The creation of analytical features by composing elementary operators is based on genetic programming search [16]. The main steps of this search are the following: 1. Construction of an initial population of analytical features, by random compositions of operators. 2. Evaluation: compute each feature on all the training signals, then use a classifier (see Section 2.3) to assess performance. 3. Iteration of the process. The next population is built from the best features found in the current population, to which are added new features obtained using various genetic transforms of the current features. This genetic procedure explores parts of the infinite set of all analytical functions composed of basic operators. The convergence towards “meaningful” or “interesting” analytical features is not guaranteed as this heuristic-based approach can be entrapped into local minima. The genetic transforms of step 3 are the following: Substitution: replacing one operators by another one with a compatible type. E.g. (A’) Max(Mfcc(Differentiation(x),5)) is a substitution (Max replaces Mean) of (A) Cloning: special case of substitution which consists in copying a feature but changing its parameters, e.g. : (B’) Median(Rms(Split(Normalize(x),64))) is a clone of (B). Mutation: an extension of substitution to sub expressions appearing in the definition of a feature, which satisfies the typing rules: (A”) Mean(Chroma(Normalize(x))) is a mutation of (A): sub expression Chroma (Normalize(x)) replaces Mfcc (Differentiation (x),5). Crossover: combining two features to create a new one while satisfying the typing rules. For instance: (C) Mean(Rms(Split(Normalize(x),32))) (C’) Median(Rms(Split(Differentiation(x))) are crossovers between (A) and (B). Addition: adding an operator to the root of a feature: (B”) Abs(Median(Rms(Split(Normalize(x),32)))) is an addition of (B). 2.3. Evaluation of features To evaluate features, we need a computable criterion which measures the quality of a feature, i.e. its capacity to distinguish elements of different classes (labels). There are various ways to define such a criterion. The Fischer Discriminant Ratio [8] is often used because it is simple to compute and reliable for binary problems (two classes). However it is notoriously not adapted to multi-class problems, in particular for non convex distributions of data. To improve feature evaluation, we chose to implement a “wrapper approach” to feature selection: features are evaluated using a classifier built during the feature search. The fitness is the performance of a classifier built with this unique feature (or more precisely its F-measure [30]) trained on the training database. This measure yields better performance than the Fischer criteria on multi-class problems. 3. PANDEIRO SOUND CLASSIFICATION The Pandeiro is a Brazilian frame drum (a type of tambourine) used in particular in Brazilian popular music (samba, côco, capoeira, chôro). As it is the case for many popular music instruments, there is no official method for playing the Pandeiro. However, the third author, a professional Pandeiro player, has developed such a method, as well as a notation of the Pandeiro, that we use in this paper. This method is based on a classification of Pandeiro sounds in exactly six categories (see Figure 1): Tung: Bass sound, also known as open sound; Ting: Higher pitched bass sound, also open; PA (big pa): A slap sound, close to the Conga slap; pa (small pa): A medium sound produced by hitting the Pandeiro head in the center. Also considered as a slap, but softer; Tchi: The jingle sound; Tr: A tremolo of jingle sounds. The need for automatically analyzing Pandeiro sounds is twofold. First, MIR applications, for education notably, require the ability to automatically transcribe Pandeiro solos.
منابع مشابه
Improving the Classification of Percussive Sounds with Analytical Features: A Case Study
There is an increasing need for automatically classifying sounds for MIR and interactive music applications. In the context of supervised classification, we conducted experiments with so-called analytical features, an approach that improves the performance of the general bag-of-frame scheme without loosing its generality. These analytical features are better, in a sense we define precisely than...
متن کاملDetermining the effective features in classification of heart sounds using trained intelligent network and genetic algorithm
Heart diseases are among the most important causes of mortality in the world, especially in industrial countries. Using heart sounds and the features extracted from them are among the non-aggressive diagnosis and prognosis methods for heart diseases. In this study, the time-scale, Cepstral, frequency, temporal and turbulence features are saved and extracted from the heart sounds, and then they ...
متن کاملAutomatic classification of normal and abnormal cardiac sounds by combining features based on wavelet transform and capstral coefficients extracted from PCG signals (Research Article)
Cardiac sounds are produced by the mechanical activities of the heart and provide useful information about the function of the heart valves. Due to the transient and unstable nature of the heart's sound and the limitation of the human hearing system, it is difficult to categorize heart sound signals based on what is heard from a stethoscope. Therefore, providing an automated algorithm for prima...
متن کاملOn the Efficacy of a Communicative Framework in Teaching English Phonological Features Absent in Persian to Iranian EFL Learners
Although Persian and English share many common phonemes, there are some phonological features that are present in English but absent in Persian which tend to lead to mispronunciation on the part of Persian learners of English, mostly through negative transfer. The present research assesses the efficacy of a communicative framework in improving Iranian adult EFL learners’ pronunciation of five E...
متن کاملAcoustic Sensitivity of the Saccule and Daf Music
Introduction: The daf is a large Persian frame drum used as a musical instrument in both popular and classical music which can induce a percussive sound at low frequencies (146–290 Hz) with peaks of up to 130 dBspl. The percussive sounds have a power distribution in the region of saccular sensitivity. In view of the saccular stimulation by sound in humans, we decided to use cervical vestibu...
متن کامل